Data like the world, can seem chaotic. In order to ask questions we have to transform the data into useful structures that the we and the computer can interact with.

In this workshop we will be using the the tidyverse library, a collection of R packages that acts as an extra layer of interaction between base R and the user without significant impacts in performance. If you haven’t installed it, do it by copying the following line in the Console panel after the >:

install.packages("tidyverse")

Hit Enter, the download and installation process should start. When finished, load the library by executing:

library("tidyverse")
## Registered S3 methods overwritten by 'ggplot2':
##   method         from 
##   [.quosures     rlang
##   c.quosures     rlang
##   print.quosures rlang
## -- Attaching packages -------------------------------------------------------------------------------------------- tidyverse 1.2.1 --
## v ggplot2 3.1.1     v purrr   0.3.2
## v tibble  2.1.3     v dplyr   0.8.1
## v tidyr   0.8.3     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## -- Conflicts ----------------------------------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

Above you can see all the libraries contained in tidyverse to be loaded. Some other libraries that might be useful to install are:

install.packages(c("readxl", "psych", "skimr"))

1 Importing data into R

1.1 The source of the data

It is recommended to take care of your folder structure by making organising your project with at least 3 folders, one for your scripts, one for your data and another one for results. To avoid problems with paths:

  1. Create an empty file that will contain your script with the extension .R in the end.
  2. Execute the script using Rstudio.
  3. To know where your current R is looking at execute getwd()

Download the following files

Data extracted from Our world in data

1.2 The path to the data

Computer locations are structured as layers one contained in the other. In order to navigate the folder structure we have to know that:

  • ./ Current location
  • ../ Out of the current location. It can be stacked e.g. ../../
  • / root (usually where the important files for the system are located)
  • ~/ home directory where you can #hygge

You can change this locations by giving full directions from the root or relative to the current folder using setwd("./directions/tofolder/inside")

1.3 The nature of the data

Data can come in multiple formats. Look at the file extension of your data file or have a look in a text editor how it is formated. Look at the middle column of first page of the Data Import Cheat Sheet. Load the data into a variable such as my_data.

1.4 Data frames vs Tibbles

Tables in base R are considered as data.frame. Tibbles are an improved version of the data.frame, when files are imported using read_ these are formatted as Tibbles. Look at the difference by running the commands as.data.frame(my_data) and as_tibble(my_data)

2 Column names and formats

When importing tables, the type of data in each column is guessed but it can also be specified. You can explore your dataset using view() in an interactive way (a new tab opens). Have a glimpse() to the imported dataset and recognise the data type of the columns:

Description Example
int integers 1, 2, 3 ,4
dbl doubles or real numbers 1.0, 2.3, 3.623, 4.78
chr characters or string (text) “Hello”, “wild-type”, “1”
dttm date-times “2018-06-09 16:45:40”
lgl logical TRUE / FALSE
fctr factors 1, 1, 2, 3, 4, 4 Levels: 1, 2, 3, 4
date dates “2018-06-09”

Column types can be reformatted at any time.

Whenever is possible avoid spaces in your column names

3 Long vs wide datasets

Tables can be found in mainly designs:

  • In Wide format all observations for each sample or subject will be contained in one row accross multiple columns
  • In Long format (or tidy) each observation is in its own row, and each variable type in its own column. There are multiple rows for each sample or subject.

In the middle column, page 2 of the Data Import Cheat Sheet you can find how to tidy your data to the suitable format. In short:

  • gather() to go from wide to long format
  • spread() to go from long to wide format

4 Pipes

One of the best enhancenments in R are pipes. They can be used to concatenate commands using %>%. This will pass the result of one function to as the first argument of the next function. In Windows pipes can also be introduced by Ctrl + shift + M. Example:

myresults <- mydata %>%
  select(column1, column2, 3:10, -column9) %>%
  filter(column1 < 0.05)

5 Explore your data

There are many things you can do with your dataset. A suggested way of operating would be:

  1. Clarify your questions or idea of what you want to see
  2. Check the Data Transformation Cheat Sheet to find the functions you need to answer your questions
  3. If the cheat sheet is not enough search your problem on internet by ading tidyverse, dplyr or R at the end of your query

A brief summary of things you can do: + select() columns + filter() values in columns + arrange() your data in a ascending or arrange(desc()) in descending order + mutate() to create new columns or overwrite existing ones + pull() a specific column as a vector + rename() columns

5.1 Group your data

Considering that your data is in long format you can group your observations based on a specific column using group_by(column_name). This will allow you to perform operations and run functions per group instead of the whole dataset.

5.2 Combine datasets

Combining tables is inspired on SQL.

5.3 Summarise your data

Custom summaries reports can be created by using summarise(). Despite being flexible, this requires a detailed specification of the types of summaries we want to see such as mean, median, maximum values, etc. Packages like skimr or psych provide a set of out of the box summary statistics for your data. Examples based on the built in dataset esoph:

library(skimr)
## 
## Attaching package: 'skimr'
## The following object is masked from 'package:stats':
## 
##     filter

esoph %>% skim() %>% print()
## Skim summary statistics
##  n obs: 88 
##  n variables: 5 
## 
## -- Variable type:factor -------------------------------------------------------------------------------------------------------------
##  variable missing complete  n n_unique                         top_counts
##     agegp       0       88 88        6 45-: 16, 55-: 16, 25-: 15, 35-: 15
##     alcgp       0       88 88        4 0-3: 23, 40-: 23, 80-: 21, 120: 21
##     tobgp       0       88 88        4 0-9: 24, 10-: 24, 20-: 20, 30+: 20
##  ordered
##     TRUE
##     TRUE
##     TRUE
## 
## -- Variable type:numeric ------------------------------------------------------------------------------------------------------------
##   variable missing complete  n  mean    sd p0 p25 p50 p75 p100     hist
##     ncases       0       88 88  2.27  2.75  0   0   1   4   17 <U+2587><U+2582><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581>
##  ncontrols       0       88 88 11.08 12.72  1   3   6  14   60 <U+2587><U+2582><U+2581><U+2581><U+2581><U+2581><U+2581><U+2581>
library(psych)
## 
## Attaching package: 'psych'
## The following objects are masked from 'package:ggplot2':
## 
##     %+%, alpha

esoph %>% describe()
## # A tibble: 5 x 13
##    vars     n  mean    sd median trimmed   mad   min   max range   skew
##   <int> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
## 1     1    88  3.39  1.65      3    3.36  1.48     1     6     5 0.0465
## 2     2    88  2.45  1.12      2    2.44  1.48     1     4     3 0.0640
## 3     3    88  2.41  1.12      2    2.39  1.48     1     4     3 0.128 
## 4     4    88  2.27  2.75      1    1.85  1.48     0    17    17 2.20  
## 5     5    88 11.1  12.7       6    8.49  5.93     1    60    59 1.89  
## # ... with 2 more variables: kurtosis <dbl>, se <dbl>

6 Basic data visualization

Finally




Thanks to the support of: